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(54) Amplitude control for speech synthesis 

(57) A speech synthesizing method which synthe- 
sizes speech naturally is disclosed. Standardized frame 
power values of an n-th frame is calculated when frame 
power values at head and tail frames in a phoneme are 
standardized. An average value of the power values 
sampled from the power frequency characteristics in the 
n-th frame at a predetermined frequency interval is set 
as a mean frame power value. A sum of squares of sig- 
nal levels in one frame of a frequency signal from a 
sound source is calculated as a frame power correction 
value. A speech envelope signal is calculated as a func- 
tion having variables of the standardized frame power 
values, the frame power correction value and the mean 
frame power value. The speech envelope signal adjusts 
the amplitude level of a speech waveform signal sup- 
plied from a vocal tract filter according to the level of the 
speech envelope signal. 
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Description 

1. FIELD OF THE INVENTION 

[0001] The present invention relates to a speech 
synthesis method for artificially generating speech 
waveform signals. 

2. BACKGROUND OF THE RELATED ART 

[0002] Speech waveforms of natural speech can be 
expressed by connecting basic units which are made by 
continuously connecting phonemes, one vowel (V) and 
one consonant (C) in a form such as "CV, "CVC K or 
M VCV n . 

[0003] Accordingly, a conversation can be created 
by means of synthetic speech by processing and regis- 
tering such phonemes as data (phoneme data) in 
advance, reading out phoneme data corresponding to a 
conversation from the registered phoneme data in 
sequence, and generating sounds corresponding to 
respective read-out phoneme data. 
[0004] To create a database based on the above- 
mentioned phoneme data, firstly, a given document is 
read by a person, and his/her speech Is recorded. Then, 
speech signals reproduced from the recorded speech 
are divided into the above-mentioned phonemes. Vari- 
ous data indicative of these phonemes are registered as 
phoneme data. Then, in order to synthesize the speech, 
respective speech data is connected and supplied as a 
serial speech. 

[0005] However, respective connected phonemes 
are segmented from the separately recorded speeches. 
Hence, Irregularities exist in the vocal power with which 
the phonemes are uttered. Therefore, a problem arises 
that synthesized speech is unnatural when the uttered 
phonemes are merely connected together. 
[0006] An object of the present invention Is to pro- 
vide a speech synthesizing method for generating natu- 
ral sounding synthetic speech. 

SUMMARY OF THE INVENTION 

[0007] It is an object of the present invention to pro- 
vide a method for synthesizing speech with an appara- 
tus comprising a sound source for generating a 
frequency signal, a vocal tract filter for generating 
speech waveform signals by filtering the frequency sig- 
nal with filter characteristics corresponding to a linear 
predictive coefficient based on respective phonemes. 
[0008] In one aspect of the invention, a method 
comprises the steps of: dividing said phonemes into a 
plurality of frames having a predetermined time length, 
summing squares of speech samples in one of said plu- 
rality of frames for each frame as a frame power value, 
standardizing frame power values at head and tail 
frames in one phoneme to predetermined values, 
respectively, to obtain a frame power value of an n-th 



frame, summing squares of signal levels of a frame in 
said frequency signal to obtain a frame power correction 
value, providing a speech envelope signal by means of 
a function having variables of said standardized frame 

5 power values and said frame power correction value, 
and adjusting an amplitude level of said speech wave- 
form signal as a function of the speech envelope signal. 
[0009] As described above, the levels of the head 
and tail portions of respective phonemes are always 

w maintained at predetermined levels without substan- 
tially deforming the synthesized speech waveform. 
Therefore, phonemes are connected together smoothly 
so that natural sounding synthesized speeches can be 
generated. 

15 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0010] The aforementioned aspects and other fea- 
tures of the Invention are explained in the following 
20 description, taken in connection with the accompanying 
drawing figures wherein: 

Fig. 1 is a block diagram showing a speech synthe- 
sis apparatus according to the present invention, 
25 Fig. 2 is a block diagram showing an apparatus for 
generating phoneme data and speech synthesis 
parameters, 

Fig. 3 is a flow chart showing steps for generating 
phoneme data, 

30 Fig. 4 is a view showing a memory map in a mem- 
ory 33, 

Fig. 5 is a flow chart showing steps for calculating 
speech synthesis parameters, and 
Hg. 6 is a view showing a speech synthesis control 
35 routine based on a speech synthesis method of the 
present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

40 [0011] Fig. 1 is a block diagram showing a text 
speech synthesis device for reading a given document 
(text) by synthesizing the speech by means of a method 
according to the present invention. 
[0012] In Fig. 1, a text analyzing circuit 21 gener- 

45 ates intermediate language character string information 
including information such as accents and phrases 
peculiar to respective languages in a character string 
based on inputted text signals. The text analyzing circuit 
21 then supplies intermediate language character string 

so signals CL corresponding to the above information to a 
speech synthesis control circuit 22. 
[0013] A phoneme data memory 20, a RAM (Ran- 
dom Access Memory) 27, and a ROM (Read Only Mem- 
ory) 28 are connected to the speech synthesis control 

55 circuit 22. 

[0014] The phoneme data memory 20 stores pho- 
neme data corresponding to various phonemes which 
have been sampled from actual human voice, and 
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speech synthesizing parameters (standardized frame 
power values and mean frame power values) used for 
the speech synthesis. 

[0015] A sound source module 23 is provided with a 
pulse generator 231 for generating impulse signals hav- 5 
ing a frequency corresponding to a pitch frequency des- 
ignating signal K supplied from the speech synthesis 
control circuit 22, and a noise generator 232 for gener- 
ating noise signals carrying an unvoiced sound. The 
sound source module 23 alternatively selects the w 
impulse signal and the noise signal in response to a 
sound source selection signal Sy, supplied from the 
speech synthesis control circuit 22. The sound source 
module 23 then supplies the selected signal as a fre- 
quency signal Q to a vocal tract filter 24. 75 
[0016] The vocal tract filter 24 may include a FIR 
(Finite Impulse Response) digital filter, for example. The 
vocal tract filter 24 filters a frequency signal Q supplied 
from the sound source module 23 with a filtering coeffi- 
cient corresponding to a linear predictive code signal LP 20 
supplied from the speech synthesis control circuit 22, 
thereby generating a speech waveform signal V F 
[0017] An amplitude adjustment circuit 25 gener- 
ates an amplitude adjustment waveform signal V A( jd by 
adjusting the amplitude of a speech waveform signal V R 25 
to a level based on a speech envelope signal V m sup- 
plied from the speech synthesis control circuit 22. The 
amplitude adjustment circuit 25 then supplies the ampli- 
tude adjustment waveform signal V AU q to a speaker 26. 
The speaker 26 generates an acoustic output corre- 30 
sponding to the amplitude adjustment waveform signal 
V Aur > That is, the speaker 26 generates the reading 
speeches based on the input text signals as explained 
hereinafter. 

[0018] A method will be described hereinafter for 35 
generating the above-mentioned phoneme data and 
speech synthesis parameters stored in the phoneme 
data memory 20. 

[0019] Fig. 2 is a block diagram showing an appara- 
tus for generating speech synthesis parameters. 40 
[0020] In Fig. 2, a speech recorder 32 records a 
human speech received by a microphone 31. The 
speech recorder 32 supplies speech signals repro- 
duced from the recorded speech to a phoneme data 
generating device 30. 45 
[0021] The phoneme data generating device 30 
sequentially samples a speech signal supplied from the 
speech recorder 32 to generate a speech sample. The 
phoneme data generating device 30 then stores the sig- 
nals in a predetermined domain in a memory 33. The so 
phoneme data generating device 30 then executes 
steps for generating phonemes, as shown in Fig. 3. 
[0022] In Fig. 3, the phoneme data generating 
device 30 reads out speech samples stored in the mem- 
ory 33 in sequence. The phoneme data generating 55 
device 30 then divides the series of speech samples 
into phonemes such as "VCV" (step S1). 
[0023] For example, a Japanese spoken phrase 



"mokutekichi ni" is segmented to mo/oku/ute/eki/iti/ini/i. 
The Japanese spoken phrase "moyoslmono" Is seg- 
mented to mo/oyo/osi/imo/o no/on 0/0. The Japanese 
spoken phrase "moyorino w is segmented to 
mo/oyo/ori/ino/o. The Japanese spoken phrase 
"mokuhyono" is segmented to mo/oku/uhyo/ono/o. 
[0024] Subsequently, the phoneme data generating 
device 30 divides each segmented phoneme into 
frames of a predetermined length, for example, 10ms 
(step S2). Control information including a name of the 
phoneme to which each frame belongs, a frame length 
of the phoneme, and the frame number is added to each 
divided frame. The above frame is then stored in a given 
domain of the memory 33 (step S3). Then, the pho- 
neme data generating device 30 analyzes a linear pre- 
dictive coding LPC on every frame with respect to the 
waveform of each phoneme to generate a linear predic- 
tive coding coefficient (hereinafter called "LPC coeffi- 
cient") of 15 orders. The resultant coefficient is stored in 
a memory domain 1 of the memory 33 as shown in Fig. 
4 (step S4). It should be noted that the resultant LPC 
coefficient in step S4 is a so-called speech spectral 
envelope parameter corresponding to a filter coefficient 
of the vocal tract filter 24. Subsequently, the phoneme 
data generating device 30 reads out the LPC coefficient 
in the memory domain 1 of the memory 33, and sup- 
plies the LPC coefficient as the phoneme data (step 
S5). This phoneme data is stored in the phoneme data 
memory 20. 

[0025] Then, the phoneme data generating device 
30 calculates speech synthesis parameters as shown in 
Fig. 5 on respective phonemes stored in the memory 
33. 

[0026] In Fig. 5, the phoneme data generating 
device 30 calculates the sum of all squares of speech 
sample values in each frame in one phoneme that is 
subject to processing (hereinafter called "subject pho- 
neme") In order to generate a speech power of the 
frame. Then, as shown in Fig. 4, the speech power is 
stored in a memory domain 2 of the memory 33 as a 
frame power PC (step S12). 

[0027] Subsequently, the phoneme data generating 
device 30 stores "0" indicative of the head frame 
number in a built-in register n (not shown) (step S13). 
Then, the phoneme data generating device 30 gener- 
ates the relative position in the subject phoneme of the 
frame n indicated by the frame number stored in the 
built-in register n (step S14). The relative position is 
expressed by the following formula: 

r=(n-1)/N 

wherein, 

r: relative position, and 

N: the number of all frames in the subject phoneme. 
[0028] Then, the phoneme data generating device 
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30 reads out the frame power PC in the frame n from the 
memory domain 2 of the memory 33 shown in Fig. 4 
(step S15). The phoneme data generating device 30 
reads out the frame powers corresponding to the head 
and tail frames of the subject phoneme as the head and 
tail frame powers P a and P b , respectively, among the 
frame powers P c in the memory domain 2 (step S16). 
[0029] Then, the phoneme data generating device 
30 generates a standardized frame power P n in the 
frame n indicated by a built-in register n, by executing 
the following calculation (1) using the head and tail 
frame powers Pa, Pb, the frame power Pc obtained in 
step S15 and the relative position r. 

P n = P c / tO- r ) P a + rP b] ( 1 > 

Then, the phoneme data generating device 30 stores 
the standardized frame power Pn in a memory domain 
3 of the memory 33 (step S17). 
[0030] That is, the phoneme data generating device 
30 generates the frame power value in the frame n 
when the frame power P c in the tail frame of this subject 
phoneme is set to "1". 

[0031] Then, the phoneme data generating device 
30 reads out the LPC coefficient corresponding to the 
frame n indicated by the built-in register n from the 
memory domain 1 of the memory 33 shown in Fig. 4. 
The phoneme data generating device 30 then gener- 
ates power frequency characteristics in the frame n 
based on the LPC coefficient (step S1 8). Thereafter, the 
phoneme data generating device 30 samples a power 
value from the power frequency characteristics every 
predetermined frequency interval, and then stores the 
average value of these power values as a mean frame 
power G f in a memory domain 4 of the memory 33 
shown in Fig. 4 (step S1 9). 

[0032] Then, the phoneme data generating device 
30 adds "1" to the frame number n stored in the built-in 
register n to generate a new frame number n, the new 
frame number n replacing the previous frame number n, 
and stores the new frame number n in the built-in regis- 
ter n by substitution (step S20). Subsequently, the pho- 
neme data generating device 30 determines whether 
the frame number stored in the built-in register n equals 
(N-1) (step S21). 

[0033] In step S21, if the frame number stored in 
the built-in register n does not equal (N-1), the phoneme 
data generating device 30 returns to the step S14, and 
repeats the above-mentioned operation. Such an oper- 
ation stores the standardized frame power P n and the 
mean frame power G f corresponding to each of the 
head frame to (N-1) th frames of a subject phoneme in 
the memory domains 3 and 4, as shown in Fig. 4. 
[0034] In the step S21 , if the frame number stored 
in the built-in register n equals (N-1 ), the phoneme data 
generating device 30 respectively reads out the stand- 
ardized frame power P n and the mean frame power G f 
stored in the memory domains 3 and 4 of the memory 



33 shown in Fig. 4, and outputs the standardized frame 
power P n and the mean frame power G f (step S23). The 
standardized frame power P n and the mean frame 
power Gf are stored in the phoneme data memory 20 as 

5 speech synthesis parameters. 

[0035] That is, the respective phoneme data 
obtained by the procedure shown in Fig. 3 is associated 
with the standardized frame power P n and the mean 
frame power G f obtained by the procedure shown in Fig. 

w 5 to store the resultant data in the phoneme data mem- 
ory 20. 

[0036] The speech synthesis control circuit 22 
shown in Fig. 1 receives the phoneme data and speech 
synthesis parameters corresponding to the intermedi- 

15 ate language characters string signals CL from the text 
analyzing circuit 21, by using software stored in the 
ROM 28. The speech synthesis control circuit 22 then 
controls speech synthesis as explained hereinafter. 
[0037] The speech synthesis control circuit 22 

20 divides segments of the intermediate language charac- 
ters string signals CL into phonemes consisting of 
"VCV, and then receives the phoneme data corre- 
sponding to respective phonemes from the phoneme 
data memory 20 sequentially. The speech synthesis 

25 control circuit 22 then supplies a pitch frequency desig- 
nation signal Kfor designating the pitch frequency to the 
sound source module 23. Then, the speech synthesis 
control circuit 22 synthesizes the speech on respective 
phoneme data in order of the reading from the phoneme 

30 data memory 20. 

[0038] Fig. 6 shows a speech synthesizing control 
procedure. 

[0039] In Fig. 6, the speech synthesis control circuit 

22 selects the data for one phoneme subject to be proc- 
35 essed (hereinafter called "subject phoneme data) in the 

received order as mentioned above. The speech syn- 
thesis control circuit 22 then stores "0" indicative of the 
head frame number in the phoneme data in the built-in 
register n (not shown) (step S101). Subsequently, the 
40 speech synthesis control circuit 22 supplies a sound 
source selection signal S v to the sound source module 

23 (step S102). The sound source selection signal S v 
indicates whether the phoneme corresponding to the 
above-mentioned subject phoneme data is a voiced 

45 sound or an unvoiced sound. Depending on the sound 
source selection signal Sv, the sound module 23 gener- 
ates as a frequency signal Q one of a noise signal and 
an impulse signal having a frequency designated by the 
pitch frequency designation signal K. 

so [0040] Subsequently, the speech synthesis control 
circuit 22 samples the frequency signal Q supplied from 
the sound source module 23 for every predetermined 
interval. The control circuit 22 then calculates the sum 
of squares of respective sample values in a frame to 

55 generate a frame power correction value G s . Then, the 
speech synthesis control circuit 22 stores the frame 
power correction value G s in a built-in register G (not 
shown) (step S103). Then, the speech synthesis control 
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circuit 22 supplies the LPC coefficient to the vocal tract 
filter 24 as the linear predictive coding signal LP (step 
S104). It is noted that the LPC coefficient corresponds 
to the frame n indicated by the built-in register n in the 
subject phoneme data. Then, the speech synthesis con- 5 
trol circuit 22 reads out the standardized frame power 
P n and the mean frame power G f corresponding to the 
frame n indicated by the above-mentioned built-in regis- 
ter n in the subject phoneme data from the phoneme 
data memory 20 (step S105). Thereafter, the speech 10 
synthesis control circuit 22 calculates a speech enve- 
lope signal V m , by the following computation with the 
standardized frame power P n , the mean frame power 
Gf, and the frame power correction value Gs stored in 
the built-in register G. The speech synthesis control cir- 15 
curt 22 then supplies the speech envelope signal V m to 
an amplitude adjustment circuit 25 (step S106). 



20 

[0041] By means of the step S106, the amplitude 
adjustment circuit 25 adjusts the amplitude of the 
speech waveform signal V f supplied from the vocal tract 
filter 24 to a level corresponding to the above-men- 
tioned speech envelope signal V m . Since the connect- 25 
ing portions of respective phonemes are always 
maintained at a predetermined level through this ampli- 
tude adjustment, the connection of phonemes becomes 
smooth and hence, natural sounding synthesized 
speech is produced. 30 
[0042] Subsequently, the speech synthesis control 
circuit 22 determines whether the frame number n 
stored in the built-in register n is smaller than the total 
number of frames in the subject phoneme data N by 1 , 
that is, whether the frame number n equals (N - 1 ) (step 35 
S107). In the step S107, if it is determined that n does 
not equal (N-1), the speech synthesis control circuit 22 
adds "1 " to the frame number stored in the built-in regis- 
ter n, and stores this value as a new frame number in 
the built-in register n by substitution (step S108). After aq 
the step S108, the speech synthesis control circuit 22 
returns to the step S103, and then repeats the above- 
mentioned operation. 

[0043] On the other hand, in step S1 07, If it is deter- 
mined that the frame number n stored in the built-in reg- 45 
ister n does not equal (N-1), the speech synthesis 
control circuit 22 returns to the step S101, and repeats 
the phonemic synthesis process to next phoneme data 
in the same manner. 

[0044] The present invention has been explained so 
heretofore in conjunction with the preferred embodi- 
ment. However, it should be understood that those 
skilled in the art could easily conceive various other 
embodiments and modifications and that such embodi- 
ments and modifications fall within the scope of the 55 
appended claims. 



Claims 

1. A method for synthesizing speech with an appara- 
tus comprising a sound source for generating a fre- 
quency signal, a vocal tract filter for filtering said 
frequency signal to generate a speech waveform 
signal, said fitter having characteristics correspond- 
ing to a linear predictive coefficient calculated from 
respective phonemes in a phoneme series, com- 
prising the steps of: 

dividing said phonemes into a plurality of 
frames having a predetermined time length, 
summing squares of speech samples in one of 
said plurality of frames for each frame as a 
frame power value, 

standardizing frame power values at head and 
tail frames in one phoneme to predetermined 
values, respectively, to obtain a frame power 
value of an n-th frame, 

summing squares of signal levels of a frame in 
said frequency signal to obtain a frame power 
correction value, and 

providing a speech envelope signal by means 
of a function having variables of said standard- 
ized frame power values and said frame power 
correction value, and adjusting an amplitude 
level of said speech waveform signal as a func- 
tion of the speech envelope signal. 

2. A method according to claim 1 , further comprising: 

providing power frequency characteristics 
based on said linear predictive coefficient cor- 
responding to said n-th frame, 
calculating an average value of power values 
sampled from said power frequency character- 
istics at a predetermined frequency interval as 
a mean frame power value, 
calculating a speech waveform signal by 
means of a function having variables of said 
standardized frame power value, said frame 
power correction value and said mean frame 
power value, and 

adjusting an amplitude of said speech wave- 
form signal as a function of said speech enve- 
lope signal. 

3. A method according to claim 2, wherein said func- 
tion is expressed; 



wherein P n is said standardized frame power value, 
G s is said frame power correction value, and G f is 
said mean frame power value. 

4. A method according to claim 1, wherein said fre- 
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quency signal includes an impulse signal carrying a 
voiced sound and a noise signal carrying an 
unvoiced sound. 
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